Explainable Multimodal Emotion Recognition plays a crucial role in applications such as human-computer interaction and social media analytics. However, current approaches struggle with cue-level perception and reasoning due to two main challenges: 1) general-purpose modality encoders are pretrained to capture global structures and general semantics rather than fine-grained emotional cues, resulting in limited sensitivity to emotional signals; and 2) available datasets usually involve a trade-off between annotation quality and scale, which leads to insufficient supervision for emotional cues and ultimately limits cue-level reasoning. Moreover, existing evaluation metrics are inadequate for assessing cue-level reasoning performance. To address these challenges, we propose eXplainable Emotion GPT (XEmoGPT), a novel EMER framework capable of both perceiving and reasoning over emotional cues. It incorporates two specialized modules: the Video Emotional Cue Bridge (VECB) and the Audio Emotional Cue Bridge (AECB), which enhance the video and audio encoders through carefully designed tasks for fine-grained emotional cue perception. To further support cue-level reasoning, we construct a large-scale dataset, EmoCue, designed to teach XEmoGPT how to reason over multimodal emotional cues. In addition, we introduce EmoCue-360, an automated metric that extracts and matches emotional cues using semantic similarity, and release EmoCue-Eval, a benchmark of 400 expert-annotated samples covering diverse emotional scenarios. Experimental results show that XEmoGPT achieves strong performance in both emotional cue perception and reasoning.
Hand gesture recognition (HGR) is a fundamental technology in human computer interaction (HCI).In particular, HGR based on Doppler radar signals is suited for in-vehicle interfaces and robotic systems, necessitating lightweight and computationally efficient recognition techniques. However, conventional deep learning-based methods still suffer from high computational costs. To address this issue, we propose an Echo State Network (ESN) approach for radar-based HGR, using frequency-modulated-continuous-wave (FMCW) radar signals. Raw radar data is first converted into feature maps, such as range-time and Doppler-time maps, which are then fed into one or more recurrent neural network-based reservoirs. The obtained reservoir states are processed by readout classifiers, including ridge regression, support vector machines, and random forests. Comparative experiments demonstrate that our method outperforms existing approaches on an 11-class HGR task using the Soli dataset and surpasses existing deep learning models on a 4-class HGR task using the Dop-NET dataset. The results indicate that parallel processing using multi-reservoir ESNs are effective for recognizing temporal patterns from the multiple different feature maps in the time-space and time-frequency domains. Our ESN approaches achieve high recognition performance with low computational cost in HGR, showing great potential for more advanced HCI technologies, especially in resource-constrained environments.
Vision language models (VLMs) extend the reasoning capabilities of large language models (LLMs) to cross-modal settings, yet remain highly vulnerable to multimodal jailbreak attacks. Existing defenses predominantly rely on safety fine-tuning or aggressive token manipulations, incurring substantial training costs or significantly degrading utility. Recent research shows that LLMs inherently recognize unsafe content in text, and the incorporation of visual inputs in VLMs frequently dilutes risk-related signals. Motivated by this, we propose Risk Awareness Injection (RAI), a lightweight and training-free framework for safety calibration that restores LLM-like risk recognition by amplifying unsafe signals in VLMs. Specifically, RAI constructs an Unsafe Prototype Subspace from language embeddings and performs targeted modulation on selected high-risk visual tokens, explicitly activating safety-critical signals within the cross-modal feature space. This modulation restores the model's LLM-like ability to detect unsafe content from visual inputs, while preserving the semantic integrity of original tokens for cross-modal reasoning. Extensive experiments across multiple jailbreak and utility benchmarks demonstrate that RAI substantially reduces attack success rate without compromising task performance.
Recent advances have demonstrated the potential of decoderonly large language models (LLMs) for automatic speech recognition (ASR). However, enabling streaming recognition within this framework remains a challenge. In this work, we propose a novel streaming ASR approach that integrates a read/write policy network with monotonic chunkwise attention (MoChA) to dynamically segment speech embeddings. These segments are interleaved with label sequences during training, enabling seamless integration with the LLM. During inference, the audio stream is buffered until the MoChA module triggers a read signal, at which point the buffered segment together with the previous token is fed into the LLM for the next token prediction. We also introduce a minimal-latency training objective to guide the policy network toward accurate segmentation boundaries. Furthermore, we adopt a joint training strategy in which a non-streaming LLM-ASR model and our streaming model share parameters. Experiments on the AISHELL-1 and AISHELL-2 Mandarin benchmarks demonstrate that our method consistently outperforms recent streaming ASR baselines, achieving character error rates of 5.1% and 5.5%, respectively. The latency optimization results in a 62.5% reduction in average token generation delay with negligible impact on recognition accuracy
Steady-state visually evoked potentials (SSVEP)-based brain-computer interfaces (BCIs) are widely used due to their high signal-to-noise ratio and user-friendliness. Accurate decoding of SSVEP signals is crucial for interpreting user intentions in BCI applications. However, signal variability across subjects and the costly user-specific annotation limit recognition performance. Therefore, we propose a novel cross-subject domain adaptation method built upon the self-training paradigm. Specifically, a Filter-Bank Euclidean Alignment (FBEA) strategy is designed to exploit frequency information from SSVEP filter banks. Then, we propose a Cross-Subject Self-Training (CSST) framework consisting of two stages: Pre-Training with Adversarial Learning (PTAL), which aligns the source and target distributions, and Dual-Ensemble Self-Training (DEST), which refines pseudo-label quality. Moreover, we introduce a Time-Frequency Augmented Contrastive Learning (TFA-CL) module to enhance feature discriminability across multiple augmented views. Extensive experiments on the Benchmark and BETA datasets demonstrate that our approach achieves state-of-the-art performance across varying signal lengths, highlighting its superiority.
In audiovisual automatic speech recognition (AV-ASR) systems, information fusion of visual features in a pre-trained ASR has been proven as a promising method to improve noise robustness. In this work, based on the prominent Whisper ASR, first, we propose a simple and effective visual fusion method -- use of visual features both in encoder and decoder (dual-use) -- to learn the audiovisual interactions in the encoder and to weigh modalities in the decoder. Second, we compare visual fusion methods in Whisper models of various sizes. Our proposed dual-use method shows consistent noise robustness improvement, e.g., a 35% relative improvement (WER: 4.41% vs. 6.83%) based on Whisper small, and a 57% relative improvement (WER: 4.07% vs. 9.53%) based on Whisper medium, compared to typical reference middle fusion in babble noise with a signal-to-noise ratio (SNR) of 0dB. Third, we conduct ablation studies examining the impact of various module designs and fusion options. Fine-tuned on 1929 hours of audiovisual data, our dual-use method using Whisper medium achieves 4.08% (MUSAN babble noise) and 4.43% (NoiseX babble noise) average WER across various SNRs, thereby establishing a new state-of-the-art in noisy conditions on the LRS3 AV-ASR benchmark. Our code is at https://github.com/ifnspaml/Dual-Use-AVASR
We design a gesture-recognition pipeline for networks of distributed, resource constrained devices utilising Einsum Networks. Einsum Networks are probabilistic circuits that feature a tractable inference, explainability, and energy efficiency. The system is validated in a scenario of low-power, body-worn, passive Radio Frequency Identification-based gesture recognition. Each constrained device includes task-specific processing units responsible for Received Signal Strength (RSS) and phase processing or Angle of Arrival (AoA) estimation, along with feature extraction, as well as dedicated Einsum hardware that processes the extracted features. The output of all constrained devices is then fused in a decision aggregation module to predict gestures. Experimental results demonstrate that the method outperforms the benchmark models.
Understanding human emotions from multimodal signals poses a significant challenge in affective computing and human-robot interaction. While multimodal large language models (MLLMs) have excelled in general vision-language tasks, their capabilities in emotional reasoning remain limited. The field currently suffers from a scarcity of large-scale datasets with high-quality, descriptive emotion annotations and lacks standardized benchmarks for evaluation. Our preliminary framework, Emotion-LLaMA, pioneered instruction-tuned multimodal learning for emotion reasoning but was restricted by explicit face detectors, implicit fusion strategies, and low-quality training data with limited scale. To address these limitations, we present Emotion-LLaMAv2 and the MMEVerse benchmark, establishing an end-to-end pipeline together with a standardized evaluation setting for emotion recognition and reasoning. Emotion-LLaMAv2 introduces three key advances. First, an end-to-end multiview encoder eliminates external face detection and captures nuanced emotional cues via richer spatial and temporal multiview tokens. Second, a Conv Attention pre-fusion module is designed to enable simultaneous local and global multimodal feature interactions external to the LLM backbone. Third, a perception-to-cognition curriculum instruction tuning scheme within the LLaMA2 backbone unifies emotion recognition and free-form emotion reasoning. To support large-scale training and reproducible evaluation, MMEVerse aggregates twelve publicly available emotion datasets, including IEMOCAP, MELD, DFEW, and MAFW, into a unified multimodal instruction format. The data are re-annotated via a multi-agent pipeline involving Qwen2 Audio, Qwen2.5 VL, and GPT 4o, producing 130k training clips and 36k testing clips across 18 evaluation benchmarks.
Audio-visual speech recognition (AVSR) typically improves recognition accuracy in noisy environments by integrating noise-immune visual cues with audio signals. Nevertheless, high-noise audio inputs are prone to introducing adverse interference into the feature fusion process. To mitigate this, recent AVSR methods often adopt mask-based strategies to filter audio noise during feature interaction and fusion, yet such methods risk discarding semantically relevant information alongside noise. In this work, we propose an end-to-end noise-robust AVSR framework coupled with speech enhancement, eliminating the need for explicit noise mask generation. This framework leverages a Conformer-based bottleneck fusion module to implicitly refine noisy audio features with video assistance. By reducing modality redundancy and enhancing inter-modal interactions, our method preserves speech semantic integrity to achieve robust recognition performance. Experimental evaluations on the public LRS3 benchmark suggest that our method outperforms prior advanced mask-based baselines under noisy conditions.
The received in-phase and quadrature (I/Q) baseband signals inherently encode physical-layer and channel characteristics of wireless links. Learning robust and transferable representations directly from such raw signals, however, remains challenging due to heterogeneous communication systems, diverse propagation environments, and limited labeled data. To address this, we present LWM-Spectro, a transformer-based foundation model pretrained on large-scale I/Q data represented as time-frequency spectrograms. The model leverages self-supervised masked modeling, contrastive learning, and a mixture-of-experts (MoE) architecture to learn general-purpose wireless representations. These representations transfer effectively to downstream tasks such as modulation classification and joint SNR/mobility recognition, even with minimal supervision. Across tasks, LWM-Spectro consistently outperforms state-of-the-art deep learning baselines in both few-shot and data-rich regimes, providing a unified foundation for wireless learning.